perf: optimize array_remove for scalar needle#22390
Conversation
neilconway
left a comment
There was a problem hiding this comment.
Nice performance win!
| ); | ||
| } | ||
| }; | ||
| let original_data = list_array.values().to_data(); |
There was a problem hiding this comment.
This will be inefficient for sliced arrays.
There was a problem hiding this comment.
I now slice the values to the range actually referenced by the offsets.
That said, I wanted to understand your concern better: when a GenericListArray is sliced, values() returns the full underlying array, and to_data() on it wraps the existing buffer references into ArrayData without copying. So the main downside I could identify is that Capacities::Array(original_data.len()) over-estimates the pre-allocation for sliced inputs. Were you thinking of a different inefficiency, or is the over-allocation what you had in mind?
There was a problem hiding this comment.
The over-allocation was one part, but the bigger concern is calling the distinct kernel on the entire values buffer (see other comment).
| // Iterate only over the positions that need removal using set_indices, | ||
| // which is more efficient than scanning every bit. |
There was a problem hiding this comment.
Might be worth elaborating that the win here is mostly because we expect the # of values-to-remove is a lot smaller than the total array size, which it usually (but not always) will be.
| let keep_mask = | ||
| arrow_ord::cmp::distinct(list_array.values(), &Scalar::new(Arc::clone(needle)))?; |
There was a problem hiding this comment.
This will call the distinct kernel on all the elements in the value buffer, not just the ones that are visible in a sliced array.
| ); | ||
| } | ||
| }; | ||
| let original_data = list_array.values().to_data(); |
There was a problem hiding this comment.
The over-allocation was one part, but the bigger concern is calling the distinct kernel on the entire values buffer (see other comment).
neilconway
left a comment
There was a problem hiding this comment.
Awesome, nice work on this! lgtm. Maybe update the benchmark numbers in the PR description if you get a chance.
|
Thanks @neilconway! Really appreciate the thorough and speedy review 🙏 Great suggestion — running benchmarks now and will update the pr description with fresh numbers shortly. |
| let mut offsets = Vec::<OffsetSize>::with_capacity(list_array.len() + 1); | ||
| offsets.push(OffsetSize::zero()); | ||
|
|
||
| let mut mutable = MutableArrayData::with_capacities( |
There was a problem hiding this comment.
I wonder if an approach using take kernel could provide even more performance gains?
There was a problem hiding this comment.
I benchmarked both take and filter kernel approaches against the current MutableArrayData::extend path. Overall, MutableArrayData performs best — on small lists (size=10) take is faster (~10%), possibly by avoiding MutableArrayData initialization overhead, but on medium/large lists (size≥100) MutableArrayData tends to win decisively (take is 60–170% slower depending on type). For variable-length types (strings), the gap appears to widen further.
One possible explanation is that take performs per-index random access for each element, whereas MutableArrayData may instead execute contiguous memcpy operations over memory regions?
There was a problem hiding this comment.
But, the current benchmarks use 10% needle density (sparse removals / high retention), which is likely the most common case? For dense removal workloads the trade-offs may shift, take or filter could become competitive when there are fewer contiguous ranges to memcpy.
There was a problem hiding this comment.
That makes sense, especially as we're operating on list arrays anyway
|
run benchmark array_remove |
|
🤖 Criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing perf/remove (4805bb7) to 265473e (merge-base) diff File an issue against this benchmark runner |
|
🤖 Criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
@Jefffrey Addressed all comments. Thank you for the detailed review! |
|
Thanks @lyne7-sc & @neilconway |
Which issue does this PR close?
Rationale for this change
Similar to #22387 (array_replace scalar optimization)
array_remove/array_remove_n/array_remove_allperform element-wise comparison by invokingcompare_element_to_listagainst each row's sub-array individually. When the needle is a scalar, this can be optimized by performing a single vectorizeddistinctcomparison over the entire flattened values buffer.What changes are included in this PR?
general_remove_with_scalar) that usesarrow_ord::cmp::distinctwithScalarwrapper for a single bulk comparison pass over the flat values buffer.nvalues, and LargeList type coverage.Benchmarks
Are these changes tested?
Yes, existing and new SLT edge-case tests in
array_remove.slt.Are there any user-facing changes?
No.